fix: restore per-cycle span duration for execute_event_loop_cycle by Di-Is · Pull Request #1939 · strands-agents/sdk-python

Di-Is · 2026-03-20T02:46:04Z

Description

Since v1.24.0 (PR #1293), execute_event_loop_cycle spans no longer reflect per-cycle duration. When a cycle performs tool use and recurses, the parent cycle's native OTel span stays open until all recursive children complete, producing cumulative bottom-up latency instead of per-step latency in observability backends (Langfuse, Jaeger, etc.).

The root cause: event_loop_cycle() is an async generator whose body was wrapped in use_span(end_on_exit=True). Because yield keeps the context manager open across recursive cycles, all span.end() calls fire simultaneously when the generator chain unwinds. The logical metadata (gen_ai.event.end_time attribute) is set at the correct time, but backends use the native OTel endTimeUnixNano from span.end().

Three commits, each reviewable independently:

fix: restore explicit span.end() to fix span end_time regression

Core fix. Switches to end_on_exit=False and restores explicit span.end() calls in end_event_loop_cycle_span() and end_model_invoke_span(), with end_span_with_error() on exception paths.

fix: handle BaseException in trace spans to prevent span leaks on KeyboardInterrupt

With end_on_exit=False, spans must be explicitly closed on all paths.
Adds except BaseException handlers and aclose() for async generators to cover KeyboardInterrupt and asyncio.CancelledError.

perf: only force flush tracer provider when ending agent spans

Independent optimization. _end_span() previously called force_flush() on every span end; this limits it to agent span completion only.
Happy to split this into a separate PR if preferred.

Resolve #1930, #1938

Related Issues

Documentation PR

N/A

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

PR strands-agents#1293 wrapped event_loop_cycle() in use_span(end_on_exit=True) and removed explicit span.end() calls. Because event_loop_cycle is an async generator, yield keeps the context manager open across recursive cycles, causing all execute_event_loop_cycle spans to share the same OTel end_time. Switch to end_on_exit=False and explicitly call span.end() via _end_span() in end_event_loop_cycle_span() and end_model_invoke_span(), restoring end_span_with_error() in all exception paths.

…boardInterrupt Trace spans were not properly closed when BaseException (e.g. KeyboardInterrupt, asyncio.CancelledError) was raised. Add explicit BaseException handlers to close spans and aclose() calls to ensure async generators are cleaned up.

Reduce overhead by limiting force_flush calls to agent span completion instead of every span end. Add flush parameter to _end_span() with default False, passing True only from end_agent_span().

zastrowm · 2026-03-31T14:13:45Z

@Di-Is thanks for this; going to take a look.

Would it be possible to have a PR isolated to the span issue, and then spin the rest to a separate PR? It would make it easier to digest

src/strands/event_loop/event_loop.py

codecov · 2026-03-31T14:17:53Z

Codecov Report

❌ Patch coverage is 95.77465% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/strands/event_loop/event_loop.py	94.64%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-03-31T14:22:53Z

src/strands/telemetry/tracer.py

+            flush: Force the tracer provider to flush after ending the span
        """
-        if not span:
+        if not span or not span.is_recording():


Issue: The check not span.is_recording() is a good defensive addition, but the docstring doesn't mention this guard condition.

Suggestion: Update the docstring to note that the method is a no-op if the span is not recording:

"""Generic helper method to end a span. This method is a no-op if span is None or not recording. Args: ...

src/strands/event_loop/event_loop.py

github-actions · 2026-03-31T14:22:56Z

Re: Splitting the PR

I think this PR is appropriately scoped given that commits 1 and 2 are closely related:

Commit 1 switches to explicit span ending (end_on_exit=False)
Commit 2 handles the necessary consequence: without auto-exit, spans must be closed on all paths including BaseException

These are two halves of the same fix - separating them would leave the codebase in an incomplete state between merges.

Commit 3 (force_flush optimization) is indeed independent and could be split out if preferred, though it's a small, low-risk change.

The author has structured commits well for independent review - each is reviewable on its own with clear purpose.

github-actions · 2026-03-31T14:22:57Z

Review Summary

Assessment: Comment (with minor suggestions)

This PR correctly addresses a real regression in span timing for observability. The fix is well-reasoned: async generator yield semantics require explicit span ending rather than relying on context manager exit.

Review Details

Strengths

Root cause analysis in PR description is excellent - clearly explains the async generator timing issue
Test coverage is comprehensive with cases for aclose(), task cancellation, and KeyboardInterrupt
Commit organization makes review straightforward
Integration test validates actual span timing behavior

Suggestions (Non-blocking)

Code simplification: The except Exception + except BaseException pattern can be consolidated since both handlers do the same thing (see inline comment)
Documentation: Minor docstring update for _end_span to note the is_recording() guard

Questions for Author

The force_flush optimization in commit 3 - have you verified this doesn't impact span export reliability for short-lived agent invocations? The change seems reasonable but curious about your testing.

Overall this is a solid bug fix. Thank you for the detailed analysis and test coverage! 🎉

lizradway · 2026-03-31T15:47:36Z

@Di-Is thanks for this; going to take a look.

Would it be possible to have a PR isolated to the span issue, and then spin the rest to a separate PR? It would make it easier to digest

+1

github-actions · 2026-03-31T18:30:05Z

Correction to my earlier review comment about exception handlers:

After re-analyzing the code more carefully, I realize my suggestion to consolidate all except Exception + except BaseException handlers was too broad.

Looking at the three locations:

Lines 166-171: ✅ Could be simplified - both handlers do the same thing (end span + re-raise)
Lines 233-250: ❌ Cannot be simplified - except Exception wraps errors in EventLoopException and yields ForceStopEvent, while except BaseException should NOT do this for KeyboardInterrupt/CancelledError
Lines 395-427: ❌ Cannot be simplified - except Exception handles retry logic and invokes hooks, while except BaseException should skip retry logic for interrupts

The design is intentional and correct - BaseException subclasses like KeyboardInterrupt and asyncio.CancelledError should propagate immediately without being wrapped in EventLoopException or triggering retry logic.

Only the first location (lines 166-171) could be simplified if desired, but it's a minor point.

github-actions · 2026-03-31T18:30:05Z

Updated Review Summary

Assessment: Approve ✅

After re-reviewing with the latest changes and analyzing the exception handling patterns more carefully, this PR is well-designed:

Key Points

Exception Handling Design

The separate except Exception / except BaseException handlers are intentional in most places:

Exception handlers include retry logic, logging, and wrapping in EventLoopException
BaseException handlers ensure KeyboardInterrupt and asyncio.CancelledError propagate cleanly without retry attempts or exception wrapping

Test Coverage

The tests comprehensively cover:

Stream aclose() scenarios
Task cancellation
KeyboardInterrupt handling
Proper span timing across recursive cycles

Minor Note

Only lines 166-171 could potentially be simplified (both handlers do the same thing), but this is a very minor point.

The PR correctly fixes the span duration regression while handling all edge cases properly. Ready to merge! 🚀

zastrowm · 2026-03-31T20:22:00Z

Execution/run via langfuse: https://us.cloud.langfuse.com/project/cmct6aysb00eaad07kq618eto/traces/1a462b3368a0e6a5a0b529551df069b4?timestamp=2026-03-31T18%3A06%3A35.626Z, showing the fixed times

lizradway · 2026-04-01T14:22:06Z

src/strands/telemetry/tracer.py

                )

-        self._end_span(span, attributes, error)
+        self._end_span(span, attributes, error, flush=True)


nit: _end_span is a shared helper used by all span types (model, tool, cycle, agent). The flush param only applies to the agent span, so it feels out of place threading it through this parameter. Could just call force_flush() directly in end_agent_span after the _end_span call and remove the span-level flush in _end_span.

ie.

def end_agent_span(self, span, response=None, error=None): ... self._end_span(span, attributes, error) if self.tracer_provider and hasattr(self.tracer_provider, "force_flush"): try: self.tracer_provider.force_flush() except Exception as e: logger.warning("error=<%s> | failed to force flush", e)

I think this threading makes sense for introducing minimal changes/disruptions in the codebase, but it does give me some pause on clarity/maintainability.

lizradway · 2026-04-01T14:41:21Z

Looks like test_event_loop_cycle_closes_spans_on_stream_aclose is failing, seems the model span isn’t being closed when the stream is closed. Only the cycle span gets cleaned up.

zastrowm · 2026-04-01T14:44:36Z

Looks like test_event_loop_cycle_closes_spans_on_stream_aclose is failing, seems the model span isn’t being closed when the stream is closed. Only the cycle span gets cleaned up.

I'm going to force push this to remove my commits and split out to a separate PR containing only the span fix

zastrowm · 2026-04-01T16:29:00Z

Hey @Di-Is , thanks for the fixes; I cherry-picked the duration fix to fix: restore explicit span.end() to fix span end_time regression by zastrowm · Pull Request #2032 · strands-agents/sdk-python and got it merged in. Sorry for all the chatter on this PR; eventually to get the fix in sooner than later we had to split it out, so I reverted my changes to this PR.

If you're willing, the other two commits would be good as standalone PRs; @ me regardding them if you spin them out

poshinchen · 2026-04-07T17:56:52Z

@Di-Is I think we can close this one? I'm reviewing #2068

Di-Is added 3 commits March 20, 2026 10:59

perf: only force flush tracer provider when ending agent spans

1b7b6f2

Reduce overhead by limiting force_flush calls to agent span completion instead of every span end. Add flush parameter to _end_span() with default False, passing True only from end_agent_span().

github-actions bot added the size/l label Mar 20, 2026

Di-Is temporarily deployed to manual-approval March 20, 2026 02:46 — with GitHub Actions Inactive

Di-Is had a problem deploying to manual-approval March 20, 2026 02:46 — with GitHub Actions Failure

Di-Is changed the title ~~Fix/trace span base exception handling~~ Fix: trace span base exception handling Mar 20, 2026

Di-Is changed the title ~~Fix: trace span base exception handling~~ fix: restore per-cycle span duration for execute_event_loop_cycle Mar 20, 2026

zastrowm reviewed Mar 31, 2026

View reviewed changes

src/strands/event_loop/event_loop.py Show resolved Hide resolved

github-actions bot added the strands-running label Mar 31, 2026

lizradway self-requested a review March 31, 2026 14:20

github-actions bot reviewed Mar 31, 2026

View reviewed changes

src/strands/event_loop/event_loop.py Show resolved Hide resolved

github-actions bot removed the strands-running label Mar 31, 2026

github-actions bot added size/l and removed size/l labels Mar 31, 2026

zastrowm requested a deployment to manual-approval March 31, 2026 18:19 — with GitHub Actions Waiting

zastrowm temporarily deployed to manual-approval March 31, 2026 18:19 — with GitHub Actions Inactive

github-actions bot added the strands-running label Mar 31, 2026

github-actions bot removed the strands-running label Mar 31, 2026

github-actions bot added size/l and removed size/l labels Mar 31, 2026

zastrowm requested a deployment to manual-approval March 31, 2026 20:22 — with GitHub Actions Waiting

lizradway reviewed Apr 1, 2026

View reviewed changes

zastrowm mentioned this pull request Apr 1, 2026

fix: restore explicit span.end() to fix span end_time regression #2032

Merged

7 tasks

zastrowm force-pushed the fix/trace-span-base-exception-handling branch from ab54df7 to 1b7b6f2 Compare April 1, 2026 14:46

github-actions bot added size/l and removed size/l labels Apr 1, 2026

zastrowm requested a deployment to manual-approval April 1, 2026 14:46 — with GitHub Actions Waiting

github-actions bot added size/l and removed size/l labels Apr 1, 2026

zastrowm requested a deployment to manual-approval April 1, 2026 15:09 — with GitHub Actions Waiting

zastrowm force-pushed the fix/trace-span-base-exception-handling branch from 29eaae5 to 1b7b6f2 Compare April 1, 2026 15:13

github-actions bot added size/l and removed size/l labels Apr 1, 2026

zastrowm requested a deployment to manual-approval April 1, 2026 15:13 — with GitHub Actions Waiting

lizradway mentioned this pull request Apr 2, 2026

[FEATURE] OTEL exporter to be an asynchronous operation #2036

Open

Di-Is mentioned this pull request Apr 5, 2026

fix: handle BaseException in trace spans to prevent span leaks on Key… #2068

Open

7 tasks

Conversation

Di-Is commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

zastrowm commented Mar 31, 2026

Uh oh!

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Review Summary

Strengths

Suggestions (Non-blocking)

Questions for Author

Uh oh!

lizradway commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Updated Review Summary

Exception Handling Design

Test Coverage

Minor Note

Uh oh!

zastrowm commented Mar 31, 2026

Uh oh!

lizradway Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

lizradway commented Apr 1, 2026

Uh oh!

zastrowm commented Apr 1, 2026

Uh oh!

zastrowm commented Apr 1, 2026

Uh oh!

poshinchen commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Di-Is commented Mar 20, 2026 •

edited

Loading

codecov bot commented Mar 31, 2026 •

edited

Loading